Final Project [WIP] Update 3

final_project
tidyverse
ggplot2
summarytools
lubridate
sf
Crime, Drug Usage and Demographics
Author

Saaradhaa M

Published

August 31, 2022

library(tidyverse)
library(ggplot2)
library(summarytools)
library(lubridate)
library(sf)
library(chisq.posthoc.test)
library(forcats)
library(mapview)

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

Introduction

*adding a new sentence here and resubmitting this to see if it works on the blog. *resubmit attempt 2.

Brooklyn Nine-Nine, which is about a police precinct in New York City, is one of my favourite TV shows. It made me want to learn more about actual crime statistics in the city.

Specifically, this project examines how crime intersects with other social issues. The disparities in arrest rates across demographic groups have been well-documented (Jahn et al., 2022; Schleiden et al., 2020). Past research has also revealed a positive association between crime and drug usage (Pierce et al., 2017).

Research Questions
  1. What is the demographic profile of those arrested for drug-related crimes? Does it differ from that of general offenders?
  2. Suppose you were walking through a borough in New York City, and you see syringe litter. Does this necessarily mean that the borough in question is unsafe – or is this merely a bias that we have? In simpler terms: where are syringes likely to be found? Is there an overlap with where crimes (especially drug-related crimes) occur?

Crime is operationalized as arrest data (New York Police Department, 2022a; New York Police Department, 2022b), and drug usage as syringe litter data (NYC Parks, 2022a). The data spans from 2017 to 2022. Codebooks are provided here (New York Police Department, 2022a) and here (NYC Parks, 2022b).

Read In Data

# read in 3 main datasets.
arrest_historic <- read_csv("~/Desktop/umass/601final/historic.csv", 
                   show_col_types = FALSE, 
                   col_names = c("del", "date", "del", "del", "del", "a_desc", "del", "a_offenselevel", "a_boro", "del", "del", "a_age", "a_sex", "a_race", "del", "del", "a_lat", "a_long", "del"), skip=1) %>% 
  select(!starts_with("del"))

arrest_ytd <- read_csv("~/Desktop/umass/601final/ytd.csv", 
                   show_col_types = FALSE, 
                   col_names = c("del", "date", "del", "del", "del", "a_desc", "del", "a_offenselevel", "a_boro", "del", "del", "a_age", "a_sex", "a_race", "del", "del", "a_lat", "a_long", "del"), skip=1) %>% 
  select(!starts_with("del"))
arrest_ytd$a_offenselevel <- na_if(arrest_ytd$a_offenselevel, "9")

syringe <- read_csv("~/Desktop/umass/601final/syringe.csv", show_col_types = FALSE, col_names = c("del", "gispropnum", "del", "date", "del", "del", "del", "del", "del", "del", "s_location", "s_groundsyringes", "s_kiosksyringes", "s_totalsyringes", "del", "s_kiosktype", "del", "s_boro", "del", "del", "s_kioskavailable", "del", "del"), skip=1) %>% 
  select(!starts_with("del")) %>%
  filter(s_totalsyringes > 0)

# make any implicit missing values explicit.
arrest_historic <- complete(arrest_historic)
arrest_ytd <- complete(arrest_ytd)
syringe <- complete(syringe)

# summarise the 3 datasets.
print(dfSummary(arrest_historic, varnumbers = FALSE, plain.ascii = FALSE, graph.magnif = 0.30, style = "grid", valid.col = FALSE), 
      method = 'render', table.classes = 'table-condensed')

Data Frame Summary

arrest_historic

Dimensions: 5308876 x 9
Duplicates: 389555
Variable Stats / Values Freqs (% of Valid) Graph Missing
date [character]
1. 01/20/2010
2. 05/13/2009
3. 03/07/2012
4. 02/11/2009
5. 02/01/2012
6. 05/20/2010
7. 02/12/2009
8. 01/14/2009
9. 03/05/2010
10. 01/22/2010
[ 5834 others ]
1773(0.0%)
1772(0.0%)
1750(0.0%)
1738(0.0%)
1726(0.0%)
1725(0.0%)
1707(0.0%)
1701(0.0%)
1700(0.0%)
1694(0.0%)
5291590(99.7%)
0 (0.0%)
a_desc [character]
1. DANGEROUS DRUGS
2. ASSAULT 3 & RELATED OFFEN
3. OTHER OFFENSES RELATED TO
4. OTHER STATE LAWS
5. PETIT LARCENY
6. FELONY ASSAULT
7. VEHICLE AND TRAFFIC LAWS
8. DANGEROUS WEAPONS
9. CRIMINAL TRESPASS
10. MISCELLANEOUS PENAL LAW
[ 78 others ]
1099188(20.7%)
542463(10.2%)
304191(5.7%)
241696(4.6%)
234571(4.4%)
225812(4.3%)
210145(4.0%)
204763(3.9%)
202252(3.8%)
196255(3.7%)
1838371(34.7%)
9169 (0.2%)
a_offenselevel [character]
1. F
2. I
3. M
4. V
1475039(27.9%)
26488(0.5%)
3493645(66.1%)
293450(5.5%)
20254 (0.4%)
a_boro [character]
1. B
2. K
3. M
4. Q
5. S
1212924(22.8%)
1473010(27.7%)
1430633(26.9%)
1003502(18.9%)
188799(3.6%)
8 (0.0%)
a_age [character]
1. 25-44
2. 18-24
3. 45-64
4. <18
5. 65+
6. 895
7. 894
8. 935
9. 945
10. 928
[ 81 others ]
2482601(46.8%)
1375134(25.9%)
981931(18.5%)
422931(8.0%)
46084(0.9%)
13(0.0%)
7(0.0%)
7(0.0%)
7(0.0%)
5(0.0%)
139(0.0%)
17 (0.0%)
a_sex [character]
1. F
2. M
891469(16.8%)
4417407(83.2%)
0 (0.0%)
a_race [character]
1. AMERICAN INDIAN/ALASKAN N
2. ASIAN / PACIFIC ISLANDER
3. BLACK
4. BLACK HISPANIC
5. OTHER
6. UNKNOWN
7. WHITE
8. WHITE HISPANIC
11869(0.2%)
221280(4.2%)
2579848(48.6%)
428839(8.1%)
1363(0.0%)
51587(1.0%)
637703(12.0%)
1376387(25.9%)
0 (0.0%)
a_lat [numeric]
Mean (sd) : 40.8 (0.4)
min ≤ med ≤ max:
40.5 ≤ 40.7 ≤ 62.1
IQR (CV) : 0.1 (0)
105327 distinct values 1 (0.0%)
a_long [numeric]
Mean (sd) : -73.9 (0.1)
min ≤ med ≤ max:
-74.3 ≤ -73.9 ≤ -73.7
IQR (CV) : 0.1 (0)
104874 distinct values 1 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-09-01

print(dfSummary(arrest_ytd, varnumbers = FALSE, plain.ascii = FALSE, graph.magnif = 0.30, style = "grid", valid.col = FALSE), 
      method = 'render', table.classes = 'table-condensed')

Data Frame Summary

arrest_ytd

Dimensions: 93238 x 9
Duplicates: 10143
Variable Stats / Values Freqs (% of Valid) Graph Missing
date [character]
1. 06/01/2022
2. 05/04/2022
3. 05/05/2022
4. 03/02/2022
5. 06/07/2022
6. 02/09/2022
7. 04/06/2022
8. 03/10/2022
9. 03/16/2022
10. 05/11/2022
[ 171 others ]
748(0.8%)
747(0.8%)
738(0.8%)
724(0.8%)
717(0.8%)
716(0.8%)
713(0.8%)
712(0.8%)
710(0.8%)
709(0.8%)
86004(92.2%)
0 (0.0%)
a_desc [character]
1. ASSAULT 3 & RELATED OFFEN
2. PETIT LARCENY
3. FELONY ASSAULT
4. MISCELLANEOUS PENAL LAW
5. DANGEROUS DRUGS
6. CRIMINAL MISCHIEF & RELAT
7. ROBBERY
8. GRAND LARCENY
9. DANGEROUS WEAPONS
10. BURGLARY
[ 53 others ]
15402(16.5%)
9944(10.7%)
9128(9.8%)
6187(6.6%)
5256(5.6%)
5192(5.6%)
4839(5.2%)
4035(4.3%)
3917(4.2%)
3171(3.4%)
26167(28.1%)
0 (0.0%)
a_offenselevel [character]
1. F
2. I
3. M
4. V
41648(45.2%)
145(0.2%)
49975(54.3%)
343(0.4%)
1127 (1.2%)
a_boro [character]
1. B
2. K
3. M
4. Q
5. S
21198(22.7%)
25431(27.3%)
22997(24.7%)
19416(20.8%)
4196(4.5%)
0 (0.0%)
a_age [character]
1. <18
2. 18-24
3. 25-44
4. 45-64
5. 65+
3384(3.6%)
16255(17.4%)
53758(57.7%)
18491(19.8%)
1350(1.4%)
0 (0.0%)
a_sex [character]
1. F
2. M
16269(17.4%)
76969(82.6%)
0 (0.0%)
a_race [character]
1. AMERICAN INDIAN/ALASKAN N
2. ASIAN / PACIFIC ISLANDER
3. BLACK
4. BLACK HISPANIC
5. UNKNOWN
6. WHITE
7. WHITE HISPANIC
239(0.3%)
5006(5.4%)
46114(49.5%)
8206(8.8%)
331(0.4%)
9962(10.7%)
23380(25.1%)
0 (0.0%)
a_lat [numeric]
Mean (sd) : 40.7 (0.1)
min ≤ med ≤ max:
40.5 ≤ 40.7 ≤ 40.9
IQR (CV) : 0.1 (0)
18038 distinct values 0 (0.0%)
a_long [numeric]
Mean (sd) : -73.9 (0.1)
min ≤ med ≤ max:
-74.3 ≤ -73.9 ≤ -73.7
IQR (CV) : 0.1 (0)
18022 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-09-01

print(dfSummary(syringe, varnumbers = FALSE, plain.ascii = FALSE, graph.magnif = 0.30, style = "grid", valid.col = FALSE), 
      method = 'render', table.classes = 'table-condensed')

Data Frame Summary

syringe

Dimensions: 17531 x 9
Duplicates: 1649
Variable Stats / Values Freqs (% of Valid) Graph Missing
gispropnum [character]
1. M037
2. X045
3. X010A
4. X155
5. X200
6. X017
7. X042
8. X001A
9. X010
10. X255
[ 232 others ]
4087(23.4%)
3279(18.8%)
1580(9.0%)
1155(6.6%)
867(5.0%)
839(4.8%)
500(2.9%)
406(2.3%)
322(1.8%)
245(1.4%)
4202(24.0%)
49 (0.3%)
date [character]
1. 01/22/2019 12:00:00 AM
2. 01/03/2020 12:00:00 AM
3. 01/19/2022 12:00:00 AM
4. 04/15/2019 12:00:00 AM
5. 04/17/2019 12:00:00 AM
6. 10/22/2018 12:00:00 AM
7. 01/14/2019 12:00:00 AM
8. 04/04/2019 12:00:00 AM
9. 04/16/2019 12:00:00 AM
10. 12/15/2021 12:00:00 AM
[ 1609 others ]
81(0.5%)
41(0.2%)
39(0.2%)
39(0.2%)
39(0.2%)
39(0.2%)
38(0.2%)
38(0.2%)
38(0.2%)
37(0.2%)
17102(97.6%)
0 (0.0%)
s_location [character]
1. Highbridge Park
2. St. Mary's Park
3. Walter Gladwin Park
4. Patterson Playground
5. Clark Playground
6. Richman (Echo) Park
7. Rose Hill Park
8. Aqueduct Walk
9. Crotona Park
10. Bill Rainey Park
[ 224 others ]
4087(23.4%)
3279(18.8%)
1580(9.0%)
1155(6.6%)
867(5.0%)
839(4.8%)
500(2.9%)
491(2.8%)
322(1.8%)
245(1.4%)
4117(23.5%)
49 (0.3%)
s_groundsyringes [numeric]
Mean (sd) : 31.7 (95.3)
min ≤ med ≤ max:
1 ≤ 11 ≤ 7000
IQR (CV) : 24 (3)
328 distinct values 3284 (18.7%)
s_kiosksyringes [numeric]
Mean (sd) : 32.5 (73)
min ≤ med ≤ max:
1 ≤ 10 ≤ 1000
IQR (CV) : 27 (2.2)
110 distinct values 13977 (79.7%)
s_totalsyringes [numeric]
Mean (sd) : 32.4 (93.1)
min ≤ med ≤ max:
1 ≤ 11 ≤ 7000
IQR (CV) : 25 (2.9)
364 distinct values 0 (0.0%)
s_kiosktype [character]
1. Bathroom
2. Mailbox
921(25.4%)
2702(74.6%)
13908 (79.3%)
s_boro [character]
1. Bronx
2. Manhattan
3. Queens
4. Staten Island
12419(71.0%)
5038(28.8%)
9(0.1%)
16(0.1%)
49 (0.3%)
s_kioskavailable [logical]
1. FALSE
2. TRUE
3409(19.9%)
13727(80.1%)
395 (2.3%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-09-01

#read in shapefile for syringe dataset.
park <- st_read("_data/ParksProperties", quiet = TRUE) %>% select("gispropnum", "geometry")

In the code chunk above, all datasets have been loaded, unnecessary columns removed, columns renamed where needed and all missing values labelled NA. Some missing values were labelled “9” in arrest_ytd and have also been converted to NA.

arrest_historic contains 5,308,876 rows, while arrest_ytd contains 93,238 rows. Each row represents a single arrest and associated information (e.g., description of crime, arrest location, demographic profile of suspect). Both have the same 9 columns and only differ in time frame - arrest_historic contains data from Jan 2006 to Dec 2021, while arrest_ytd contains data from Jan to Jun 2022.

I will interpret their descriptive statistics again after combining them, but these are some initial observations:

  • Dangerous drugs are in the list of top 5 offenses in both datasets, highlighting the importance of our research question.

  • Arrests seem to occur the least often in Staten Island - it would be interesting to see if this is where the least number of syringes are collected as well.

  • Black individuals represent ~50% of all arrests in both datasets, despite only being ~20% of the population in NYC (U.S. Census Bureau, 2022).

Meanwhile, syringe contains 17,531 rows and 9 columns. Rows where s_totalsyringes were 0 were removed in the read-in step, as they do not contain meaningful information for analysis. The dataset ranges from Jan 2017 to Jul 2022. It contains information on syringe collection, including date, location and number of syringes. Each row does not yet represent a single date/location combination where syringes were collected, so it needs to be tidied before descriptives can be interpreted. The one thing that stands out to me is that no syringes were collected in Brooklyn - hopefully the analysis later can unpack why this might be the case.

I have also loaded the park shapefile (Department of Parks and Recreation, 2022), which will be used derive location coordinates for syringe.

Tidy Data

Steps to tidy arrest_historic and arrest_ytd:

  1. Convert date to “date” column type.

  2. Create a separate column, year.

  3. Remove rows before 2017 in arrest_historic.

  4. Join both datasets into arrest dataset (93,238 + 1,043,535 = 1,136,773 rows; 9 + 1 = 10 columns).

# change 'date' to date column type.
arrest_historic <- arrest_historic %>% mutate(date = as_date(parse_date_time(date, c('mdy'))))

arrest_ytd <- arrest_ytd %>% mutate(date = as_date(parse_date_time(date, c('mdy'))))
# create 'year'.
arrest_historic$"year" <- year(arrest_historic$date)

arrest_ytd$"year" <- year(arrest_ytd$date)

# remove rows before 2017 in arrest_historic.
arrest_historic <- arrest_historic %>% filter(year > 2016)

# join datasets.
arrest <- rbind(arrest_historic, arrest_ytd)

# spell out borough names in full.
arrest$a_boro <- recode(arrest$a_boro, B = 'Bronx', S = 'Staten Island', K = 'Brooklyn', M = 'Manhattan', Q = 'Queens')

# sanity check.
print(dfSummary(arrest, varnumbers = FALSE, plain.ascii = FALSE, graph.magnif = 0.30, style = "grid", valid.col = FALSE), 
      method = 'render', table.classes = 'table-condensed')

Data Frame Summary

arrest

Dimensions: 1136773 x 10
Duplicates: 97724
Variable Stats / Values Freqs (% of Valid) Graph Missing
date [Date]
min : 2017-01-01
med : 2019-02-25
max : 2022-06-30
range : 5y 5m 29d
2007 distinct values 0 (0.0%)
a_desc [character]
1. ASSAULT 3 & RELATED OFFEN
2. DANGEROUS DRUGS
3. PETIT LARCENY
4. FELONY ASSAULT
5. VEHICLE AND TRAFFIC LAWS
6. MISCELLANEOUS PENAL LAW
7. ROBBERY
8. GRAND LARCENY
9. DANGEROUS WEAPONS
10. CRIMINAL MISCHIEF & RELAT
[ 75 others ]
166190(14.6%)
129285(11.4%)
102593(9.0%)
83592(7.4%)
67024(5.9%)
62845(5.5%)
48499(4.3%)
47368(4.2%)
46133(4.1%)
32889(2.9%)
348324(30.7%)
2031 (0.2%)
a_offenselevel [character]
1. F
2. I
3. M
4. V
436854(38.7%)
2838(0.3%)
675669(59.9%)
12874(1.1%)
8538 (0.8%)
a_boro [character]
1. Bronx
2. Brooklyn
3. Manhattan
4. Queens
5. Staten Island
255022(22.4%)
312014(27.4%)
288631(25.4%)
232670(20.5%)
48436(4.3%)
0 (0.0%)
a_age [character]
1. <18
2. 18-24
3. 25-44
4. 45-64
5. 65+
52411(4.6%)
239476(21.1%)
606859(53.4%)
223661(19.7%)
14366(1.3%)
0 (0.0%)
a_sex [character]
1. F
2. M
200375(17.6%)
936398(82.4%)
0 (0.0%)
a_race [character]
1. AMERICAN INDIAN/ALASKAN N
2. ASIAN / PACIFIC ISLANDER
3. BLACK
4. BLACK HISPANIC
5. UNKNOWN
6. WHITE
7. WHITE HISPANIC
3094(0.3%)
61056(5.4%)
548316(48.2%)
98725(8.7%)
7271(0.6%)
133286(11.7%)
285025(25.1%)
0 (0.0%)
a_lat [numeric]
Mean (sd) : 40.7 (0.1)
min ≤ med ≤ max:
40.5 ≤ 40.7 ≤ 40.9
IQR (CV) : 0.1 (0)
93984 distinct values 0 (0.0%)
a_long [numeric]
Mean (sd) : -73.9 (0.1)
min ≤ med ≤ max:
-74.3 ≤ -73.9 ≤ -73.7
IQR (CV) : 0.1 (0)
93555 distinct values 0 (0.0%)
year [numeric]
Mean (sd) : 2018.9 (1.6)
min ≤ med ≤ max:
2017 ≤ 2019 ≤ 2022
IQR (CV) : 3 (0)
2017:286225(25.2%)
2018:246773(21.7%)
2019:214617(18.9%)
2020:140413(12.4%)
2021:155507(13.7%)
2022:93238(8.2%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-09-01

The sanity check demonstrates that steps 1 to 4 are complete.

Steps to tidy syringe:

  1. Convert date to “date” type.

  2. Sum syringe count values in syringe into 1 row per date/location combination.

  3. Extract latitude and longitude columns from park shapefile and convert them into a dataframe.

  4. Combine that dataframe with syringe.

# remove timestamp from 'date'.
syringe$date <- str_remove(syringe$date, "12:00:00 AM")

# convert 'date' to date column type.
syringe <- syringe %>% mutate(date = as_date(parse_date_time(date, c('mdy'))))

# sum values into 1 row per date/location combo.
syringe <- syringe %>% 
  group_by(s_location, date, gispropnum, s_kiosktype, s_boro, s_kioskavailable) %>%
summarise(across(contains("syringes"), ~sum(.x, na.rm = TRUE)))

# extract latitude and longitude from park shapefile into park_ll.
sf_use_s2(FALSE)
park_ll <- st_coordinates(st_centroid(park$geometry))

# convert park and park_ll to dataframes.
park_ll <- as_tibble(park_ll)
park <- as_tibble(park)

# remove multipolygon column from park.
park <- select(park, -2)

# combine park and park_ll.
park <- cbind(park, park_ll)

# rename long and lat columns in park.
park <- rename(park, "s_long" = "X", "s_lat" = "Y")

# combine park and syringe.
syringe <- syringe %>%
  left_join(park, by = "gispropnum")

# sanity check.
head(syringe)

The sanity check shows that steps 1 to 3 are complete. For now, the basic tidying is complete. Further transformations may be required for each visualization as we go along.

Analysis: Question 1

To recap, this is my first research question:

Question 1

What is the demographic profile of those arrested for drug-related crimes? Does it differ from that of general offenders?

Age, sex and race data are available in the arrest dataset. We can combine them together into 1 column to create demographic profiles, then make bar graphs.

# combine demographics into 1 column.
arrest <- arrest %>% unite(demo, a_race, a_sex, a_age, sep = "_", remove = FALSE, na.rm = TRUE)

# create bar graph for all arrests.
arrest %>% filter(!is.na(demo)) %>%
  count(demo) %>%
  slice_max(n, n = 5) %>%
  ggplot(aes(x = reorder(demo,-n/1136773*100), y = n/1136773 * 100, fill = demo)) +
  geom_col(stat="identity") +
  theme_minimal() + 
  labs(title = "Top Demographic Profiles of All Offenders in NYC (Jan 2017 to Jun 2022)", x = "Profile", y = "Percent") +
  geom_text(aes(label=sprintf("%0.2f", ..y..)), position=position_dodge(width=0.9), vjust=-0.25, size=3) +
  theme(axis.text.x=element_text(angle=90,hjust=1))

# create bar graph for drug-related arrests.
arrest %>% filter(a_desc == "DANGEROUS DRUGS") %>% 
  filter(!is.na(demo)) %>%
  count(demo) %>% 
  slice_max(n, n = 5) %>%
  ggplot(aes(x = reorder(demo,-n/129285*100), y = n/129285 * 100, fill = demo)) +
  geom_col(stat="identity") +
  theme_minimal() + 
  labs(title = "Top Demographic Profiles of Drug Offenders in NYC (Jan 2017 to Jun 2022)", x = "Profile", y = "Percent") +
  geom_text(aes(label=sprintf("%0.2f", ..y..)), position=position_dodge(width=0.9), vjust=-0.25, size=3) +
  theme(axis.text.x=element_text(angle=90,hjust=1))

In the above graphs, each demographic profile is labelled with a different colour, so that we can see the differences in proportion clearly. In the duration of 5 years, the top 5 demographic profiles are the same for all arrests and drug-related arrests (except for a switch in order at the 3rd and 4th positions). This somewhat answers our research question:

Question 1

When comparing all arrests and drug-related arrests, the top five demographic profiles of offenders seem to be largely similar. They are all male. Age group varies based on ethnicity: black men aged 18 to 64 are all quite likely to be arrested, although this was greater for black men aged 25-44. For white Hispanic and white men, arrests are concentrated in the 25-44 age group.

However, both bar graphs seem to show a stark difference between arrests of black males aged 25-44 and the other four groups. I am interested to see if this difference is significant.

# create new dataset for chi-square test.
chi_square <- arrest %>% filter(!is.na(demo)) %>% filter(demo ==  "BLACK_M_18-24" | demo == "BLACK_M_25-44" | demo == "BLACK_M_45-64" | demo == "WHITE HISPANIC_M_25-44" | demo == "WHITE_M_25-44")

# run chi-square test.
chisq.test(x = table(chi_square$demo))

    Chi-squared test for given probabilities

data:  table(chi_square$demo)
X-squared = 140432, df = 4, p-value < 2.2e-16
# attempt post-hoc test - this is not working.
# table_demo <- chi_square %>% count(demo)
# chisq.posthoc.test(table_demo, method = "bonferroni")

The chi-square test reveals significant differences between the top 5 demographic profiles for drug-related and non-drug-related arrests, 𝜒2 (4) = 140432, p < .001. However, this doesn’t tell us which groups had the differences. I downloaded a post-hoc package, but I can’t get it to work.

Additional Analysis: Question 1

Doing the above analysis also made me interested in exploring a few other questions:

  • How did the proportion of drug-related arrests change over time?

  • Within drug-related arrests, how did the proportions of each demographic variable (age, sex, race) change over time?

# keep only top 3 levels of a_desc.
arrest$a_desc <- fct_lump_n(arrest$a_desc, 3, other_level="Other")

# change year to factor type so i can plot it.
arrest$year <- as.factor(arrest$year)

# stacked bar graph for offense type.
arrest %>% filter(!is.na(a_desc)) %>%
  ggplot() +
  geom_bar(mapping = aes(x = year, y=..count.., fill=a_desc), position="fill") +
  theme_minimal() + 
  labs(title = "Proportion of Different Offences in NYC (Jan 2017 to Jun 2022)", x = "Year", y = "Percent") +
   scale_y_continuous(labels = scales::percent) +
  theme(axis.text.x=element_text(angle=90,hjust=1)) 

# stacked bar graph for age.
arrest %>% filter(a_desc == "DANGEROUS DRUGS") %>%
  filter(!is.na(a_age)) %>%
  ggplot() +
  geom_bar(mapping = aes(x = year, fill=a_age), position="fill") +
  theme_minimal() + 
  labs(title = "Change in Age Composition of Drug Offenders in NYC (Jan 2017 to Jun 2022)", x = "Year", y = "Percent") +
   scale_y_continuous(labels = scales::percent) +
  theme(axis.text.x=element_text(angle=90,hjust=1))

# stacked bar graph for race.
arrest %>% filter(a_desc == "DANGEROUS DRUGS") %>%
  filter(!is.na(a_race)) %>%
  ggplot() +
  geom_bar(mapping = aes(x = year, fill=a_race), position="fill") +
  theme_minimal() + 
  labs(title = "Change in Race Composition of Drug Offenders in NYC (Jan 2017 to Jun 2022)", x = "Year", y = "Percent") +
   scale_y_continuous(labels = scales::percent) +
  theme(axis.text.x=element_text(angle=90,hjust=1)) 

# stacked bar graph for sex.
arrest %>% filter(a_desc == "DANGEROUS DRUGS") %>%
  filter(!is.na(a_sex)) %>% ggplot() +
  geom_bar(mapping = aes(x = year, fill=a_sex), position="fill") +
  theme_minimal() + 
  labs(title = "Change in Sex Composition of Drug Offenders in NYC (Jan 2017 to Jun 2022)", x = "Year", y = "Percent") +
   scale_y_continuous(labels = scales::percent) +
  theme(axis.text.x=element_text(angle=90,hjust=1))

This is what we can gather from the stacked bar graphs above. - The proportion of drug-related arrests has shrunk with time - this is a good thing! Although it is a part of the top 3 reasons for arrests, it is definitely on a downward trend. It will be useful to compare this with the change in syringes collected over time, to see if NYC’s stricter policies on drug usage and offences are effective. - I need to add percent labels to the demographic graphs to interpret them.

Analysis: Question 2

To recap, this is my second research question:

Question 2

Suppose you were walking through a borough in New York City, and you see syringe litter. Does this necessarily mean that the borough in question is unsafe – or is this merely a bias that we have? In simpler terms: where are syringes likely to be found? Is there an overlap with where crimes (especially drug-related crimes) occur?

This question can be easily visualized via dot maps of NYC.

# creating subset of syringe for visualisation.
syringe_map <- syringe %>% filter(!is.na(s_long)) %>% filter(!is.na(s_lat)) %>% group_by(s_totalsyringes)

# visualisation of syringe_map using mapview.
syringe %>% filter(!is.na(s_long)) %>% filter(!is.na(s_lat)) %>% group_by(s_totalsyringes) %>% mapview(xcol = "s_long", ycol = "s_lat", crs = 4326, grid = FALSE)
# creating arrest_map.
arrest_map <- arrest %>% count(a_boro)
arrest_map$lat <-c(40.84567, 40.67908, 40.78322, 40.72955, 40.58098)
arrest_map$long <- c(-73.86136, -73.94672, -73.97198, -73.79636, -74.15237)

# visualisation of arrest_map using mapview.
arrest_map %>% mapview(xcol = "long", ycol = "lat", crs = 4326, grid = FALSE, cex = "n", legend = TRUE, layer.name = "value")

I tested the packages ggmap and mapview, and both worked great for the syringe dataset. However, the arrest dataset is simply too large for either package to load the map successfully. I decided to create a new dataframe with just the numbers of arrests in each borough and load that into mapview instead, which worked. I also like mapview better because you can zoom in and out.

Additionally, I would like to make a time series graph of how syringes collected have changed over time. We can remove outliers by finding the standard deviation of s_totalsyringes. We can also make another stacked bar graph to compare the number of kiosk and ground syringes collected.

# time graph of syringes collected.
syringe %>% filter(!is.na(date)) %>% filter(!is.na(s_totalsyringes)) %>% filter(s_totalsyringes < 2000) %>%
  ggplot(aes(x=date, y=s_totalsyringes)) +
  geom_line() + 
  geom_point() +
  xlab("") +
  theme(axis.text.x=element_text(angle=60, hjust=1))

Just pushing this update to see if the rendering will work - I will clean up the text and code before my final submission.

Planning to make 2 stacked bar graphs (proportion of syringes/arrests in different boroughs) to check if what I observe from the maps can be seen from those bar graphs too.

Bibliography

Department of Parks and Recreation. (2022). Parks Properties. NYC Open Data. Retrieved from https://data.cityofnewyork.us/Recreation/Parks-Properties/enfh-gkve.

Jahn, J., Simes, J., Cowger, T., & Davis, B. (2022). Racial Disparities in Neighborhood Arrest Rates during the COVID-19 Pandemic. Journal Of Urban Health, 99(1), 67-76.

New York Police Department. (2022). NYPD Arrest Data (Year to Date). NYC Open Data. Retrieved from https://data.cityofnewyork.us/Public-Safety/NYPD-Arrest-Data-Year-to-Date-/uip8-fykc.

New York Police Department. (2022). NYPD Arrests Data (Historic). NYC Open Data. Retrieved from https://data.cityofnewyork.us/Public-Safety/NYPD-Arrests-Data-Historic-/8h9b-rp9u.

NYC Parks. (2022). NYC Parks Syringe Litter Data Dictionary and UserGuide. Google Docs. Retrieved from https://docs.google.com/spreadsheets/d/1VSUqd1peSc-4D2XnBZNiLdxa0Jg4z62D/edit#gid=150342678.

NYC Parks. (2022). Summary of Syringe Data in NYC Parks. NYC Open Data. Retrieved from https://data.cityofnewyork.us/Public-Safety/Summary-of-Syringe-Data-in-NYC-Parks/t8xi-d5wb.

Pierce, M., Hayhurst, K., Bird, S., Hickman, M., Seddon, T., Dunn, G., & Millar, T. (2017). Insights into the link between drug use and criminality: Lifetime offending of criminally-active opiate users. Drug And Alcohol Dependence, 179, 309-316.

R Core Team. (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing.

Schleiden, C., Soloski, K., Milstead, K., & Rhynehart, A. (2020). Racial Disparities in Arrests: A Race Specific Model Explaining Arrest Rates Across Black and White Young Adults. Child And Adolescent Social Work Journal, 37(1), 1-14.

U.S. Census Bureau. (2022). U.S. Census Bureau QuickFacts: New York City, New York. Census Bureau QuickFacts. Retrieved from https://www.census.gov/quickfacts/newyorkcitynewyork.

Wickham, H., & Grolemund, G. (2017). R for Data Science: Import, Tidy, Transform, Visualize, and Model Data (1st ed.). O’Reilly Media.